Skip to content

Conversation

@lzhangzz
Copy link
Collaborator

In #4223, the change of scheduling policy causes more cache miss when sending long and repeated requests in a burst. This PR fixes it by

  1. Allowing to cache blocks from partially prefilled sequences.
  2. Allowing partially prefilled sequences to match against the cache trie.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a performance regression in prefix caching introduced by #4223 that caused increased cache misses when sending long and repeated requests in bursts. The fix enables caching and matching of prefix blocks from partially prefilled sequences.

Changes:

  • Modified CachePrompt and PrefixMatch functions to allow caching and matching of blocks from partially prefilled sequences (where cache_len < prompt.size())
  • Changed Materialize return value attribute from [[nodiscard]] to [[maybe_unused]] since the return value is not always used
  • Replaced FT_CHECK macros with TM_CHECK variants for consistency
  • Refactored code formatting and variable initialization in several files
  • Added async_ parameter to benchmark script

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/turbomind/models/llama/SequenceManager.h Changed Materialize attribute from [[nodiscard]] to [[maybe_unused]]
src/turbomind/models/llama/SequenceManager.cc Enabled caching/matching for partially prefilled sequences; replaced FT_CHECK with TM_CHECK; cleaned up formatting
src/turbomind/models/llama/BlockTrie.cc Refactored Match and Cache functions with clearer variable names and improved control flow
src/turbomind/models/llama/BlockManager.h Removed unused function declaration
src/turbomind/models/llama/BlockManager.cc Replaced FT_CHECK with TM_CHECK; removed unused function; added timestamp initialization
src/turbomind/engine/engine.cc Removed unused variables; removed unnecessary lambda captures
benchmark/profile_pipeline_api.py Added async_ parameter support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

BlockIds block_ids;
UniqueIds block_unique_ids;
std::tie(block_ids, block_unique_ids) = block_trie_->Cache(seq, seq.prompt);
if (auto& seq = *sequences[i]; !seq.prompt.empty()) {
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition !seq.prompt.empty() should also check if there are blocks available to cache. When seq.cache_len < seq.prompt.size() but seq.blocks is empty or insufficient, attempting to cache could lead to unexpected behavior. Consider adding a check for seq.blocks.size() to ensure blocks exist before caching.

Suggested change
if (auto& seq = *sequences[i]; !seq.prompt.empty()) {
if (auto& seq = *sequences[i]; !seq.prompt.empty() && !seq.blocks.empty()) {

Copilot uses AI. Check for mistakes.
}

/// TODO: `Unlock` and `Lock` can't be batched because there may be repeated blocks between sequences
if (const int offset = seq.cache_len / block_seq_len_; offset < block_ids.size()) {
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offset calculation using integer division could lead to data loss if seq.cache_len is not perfectly aligned with block_seq_len_. Consider adding a check or comment to clarify the expected behavior when seq.cache_len % block_seq_len_ != 0.

Suggested change
if (const int offset = seq.cache_len / block_seq_len_; offset < block_ids.size()) {
// NOTE: seq.cache_len is expected to be aligned to block_seq_len_. The integer division below
// intentionally converts from cached tokens to a whole-block offset; any partial block (non-zero
// remainder) is ignored here and will be effectively truncated.
const int offset = seq.cache_len / block_seq_len_;
if (seq.cache_len % block_seq_len_ != 0 && rank_ == 0) {
TM_LOG_DEBUG(
"[SeqMgr][match] ID %llu, cache_len %d not aligned to block_seq_len_ %d; truncating to %d blocks",
seq.id, seq.cache_len, block_seq_len_, offset);
}
if (offset < static_cast<int>(block_ids.size())) {

Copilot uses AI. Check for mistakes.
b.use_count = 1;
b.unique_id = unique_id_++;
FT_CHECK(is_active(b)); // post-condition
b.timestamp = timestamp_++;
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timestamp is now being set during Allocate, but this appears to be a new addition. The timestamp should logically be set when blocks are allocated, but verify this doesn't cause issues with LRU eviction logic, especially if blocks are allocated but not immediately used.

Copilot uses AI. Check for mistakes.
@lvhan028 lvhan028 merged commit ae2d004 into InternLM:main Jan 15, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants